How to Scrape Yellow Pages Using Python and LXML

您所在的位置：网站首页 › websites using yellow pages canada › How to Scrape Yellow Pages Using Python and LXML

How to Scrape Yellow Pages Using Python and LXML

2023-03-25 21:02| 来源: 网络整理| 查看: 265

In this web scraping tutorial, we will show you how to scrape yellow pages and extract business details based on a city and category.

For this scraper, we will search Yellowpages.com for restaurants in a city. Then scrape business details from the first page of results.

What data are we extracting?

Here is the list of data fields we will be extracting:

Rank Business Name Phone Number Business Page Category Website Rating Street Name Locality Region Zip code URL Learn More : Scrape Yellow Pages for FREE using ScrapeHero Cloud

Below is a screenshot of the data we will be extracting from YellowPages.

scraping-yellow-pages

Read More – Learn to scrape Yelp business data

Finding the Data

First, we need to find the data that is present in the web page’s HTML Tags before we start building the Yellowpages scraper. You’ll need to understand the HTML tags in the web pages content to do so.

If you already understand HTML and Python this will be simple for you. You don’t need advanced programming skills for the most part of this tutorial,

If you don’t know much about HTML and Python, spend some time reading Getting started with HTML – Mozilla Developer Network and https://www.programiz.com/python-programming

Let’s inspect the HTML of the web page and find out where the data is located. Here is what we’re going to do:

Find the HTML tag that encloses the list of links from where we need the data from Get the links from it and extract the data Inspecting the HTML

Why should we inspect the element? – To find any element on the web page using XML path expression.

Open any browser (we are using Chrome here) and go to https://www.yellowpages.com/search?search_terms=restaurant&geo_location_terms=Boston

Right-click on any link on the page and choose – Inspect Element. The browser will open a toolbar and show the HTML Content of the Web Page in a well-structured format.

how-to-scrape-yellow-pages

The GIF above shows the data we need to extract in the DIV tag. If you look closely, it has an attribute called ‘class’ named as ‘result’. This DIV contains the data fields we need to extract.

scraping-yellow-pages-listings

Now let’s find the HTML tag(s) which has the links we need to extract. You can right-click on the link title in the browser and do Inspect Element again. It will open the HTML Content like before, and highlight the tag which holds the data you right-clicked on. In the GIF below you can see the data fields structured in order.

yellow-pages-scraper

Scrape Yellow Pages using an API end-point How to set up your computer to scrape Yellow Pages

We will use Python 3 for this Yellow Pages scraping tutorial. The code will not run if you are using Python 2.7. To start, your system needs Python 3 and PIP installed in it.

Most UNIX operating systems like Linux and Mac OS comes with Python pre-installed. But, not all the Linux Operating Systems ship with Python 3 by default.

To check your python version. Open a terminal (in Linux and Mac OS) or Command Prompt (on Windows) and type:

python --version

Then press the Enter key. If the output looks something like Python 3.x.x, you have Python 3 installed. Likewise if its Python 2.x.x, you have Python 2. But, if it prints an error, you don’t probably have python installed.

Install Python 3 and Pip

Here is a guide to install Python 3 in Linux – http://docs.python-guide.org/en/latest/starting/install3/linux/

Mac Users can follow this guide here – http://docs.python-guide.org/en/latest/starting/install3/osx/

Likewise, Windows Users go here – https://www.scrapehero.com/how-to-install-python3-in-windows-10/

Install Packages Python Requests, to make requests and download the HTML content of the pages ( http://docs.python-requests.org/en/master/user/install/). Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html) The Code #!/usr/bin/env python # -*- coding: utf-8 -*- import requests from lxml import html import unicodecsv as csv import argparse def parse_listing(keyword, place): """ Function to process yellowpage listing page : param keyword: search query : param place : place name """ url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}".format(keyword, place) print("retrieving ", url) headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,ml;q=0.7', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.yellowpages.com', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36' } # Adding retries for retry in range(10): try: response = requests.get(url, verify=False, headers=headers) print("parsing page") if response.status_code == 200: parser = html.fromstring(response.text) # making links absolute base_url = "https://www.yellowpages.com" parser.make_links_absolute(base_url) XPATH_LISTINGS = "//div[@class='search-results organic']//div[@class='v-card']" listings = parser.xpath(XPATH_LISTINGS) scraped_results = [] for results in listings: XPATH_BUSINESS_NAME = ".//a[@class='business-name']//text()" XPATH_BUSSINESS_PAGE = ".//a[@class='business-name']//@href" XPATH_TELEPHONE = ".//div[@class='phones phone primary']//text()" XPATH_ADDRESS = ".//div[@class='info']//div//p[@itemprop='address']" XPATH_STREET = ".//div[@class='street-address']//text()" XPATH_LOCALITY = ".//div[@class='locality']//text()" XPATH_REGION = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='addressRegion']//text()" XPATH_ZIP_CODE = ".//div[@class='info']//div//p[@itemprop='address']//span[@itemprop='postalCode']//text()" XPATH_RANK = ".//div[@class='info']//h2[@class='n']/text()" XPATH_CATEGORIES = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='categories']//text()" XPATH_WEBSITE = ".//div[@class='info']//div[contains(@class,'info-section')]//div[@class='links']//a[contains(@class,'website')]/@href" XPATH_RATING = ".//div[@class='info']//div[contains(@class,'info-section')]//div[contains(@class,'result-rating')]//span//text()" raw_business_name = results.xpath(XPATH_BUSINESS_NAME) raw_business_telephone = results.xpath(XPATH_TELEPHONE) raw_business_page = results.xpath(XPATH_BUSSINESS_PAGE) raw_categories = results.xpath(XPATH_CATEGORIES) raw_website = results.xpath(XPATH_WEBSITE) raw_rating = results.xpath(XPATH_RATING) # address = results.xpath(XPATH_ADDRESS) raw_street = results.xpath(XPATH_STREET) raw_locality = results.xpath(XPATH_LOCALITY) raw_region = results.xpath(XPATH_REGION) raw_zip_code = results.xpath(XPATH_ZIP_CODE) raw_rank = results.xpath(XPATH_RANK) business_name = ''.join(raw_business_name).strip() if raw_business_name else None telephone = ''.join(raw_business_telephone).strip() if raw_business_telephone else None business_page = ''.join(raw_business_page).strip() if raw_business_page else None rank = ''.join(raw_rank).replace('.\xa0', '') if raw_rank else None category = ','.join(raw_categories).strip() if raw_categories else None website = ''.join(raw_website).strip() if raw_website else None rating = ''.join(raw_rating).replace("(", "").replace(")", "").strip() if raw_rating else None street = ''.join(raw_street).strip() if raw_street else None locality = ''.join(raw_locality).replace(',\xa0', '').strip() if raw_locality else None locality, locality_parts = locality.split(',') _, region, zipcode = locality_parts.split(' ') business_details = { 'business_name': business_name, 'telephone': telephone, 'business_page': business_page, 'rank': rank, 'category': category, 'website': website, 'rating': rating, 'street': street, 'locality': locality, 'region': region, 'zipcode': zipcode, 'listing_url': response.url } scraped_results.append(business_details) return scraped_results elif response.status_code == 404: print("Could not find a location matching", place) # no need to retry for non existing page break else: print("Failed to process page") return [] except: print("Failed to process page") return [] if __name__ == "__main__": argparser = argparse.ArgumentParser() argparser.add_argument('keyword', help='Search Keyword') argparser.add_argument('place', help='Place Name') args = argparser.parse_args() keyword = args.keyword place = args.place scraped_data = parse_listing(keyword, place) if scraped_data: print("Writing scraped data to %s-%s-yellowpages-scraped-data.csv" % (keyword, place)) with open('%s-%s-yellowpages-scraped-data.csv' % (keyword, place), 'wb') as csvfile: fieldnames = ['rank', 'business_name', 'telephone', 'business_page', 'category', 'website', 'rating', 'street', 'locality', 'region', 'zipcode', 'listing_url'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames, quoting=csv.QUOTE_ALL) writer.writeheader() for data in scraped_data: writer.writerow(data)

Execute the full code by typing the script name followed by a -h in command prompt or terminal:

usage: yellow_pages.py [-h] keyword place positional arguments: keyword Search Keyword place Place Name optional arguments: -h, --help show this help message and exit

Read More – Learn to scrape Nasdaq for stock market data

Learn to scrape Ebay product data

The positional arguments keyword represents a category and place is the desired location to search for a business. As an example, let’s find the business details for restaurants in Boston, MA. The script would be executed as:

python3 yellow_pages.py restaurants Boston,MA

You should see a file called restaurants-boston-yellowpages-scraped-data.csv in the same folder as the script, with the extracted data. Here is some sample data of the business details extracted from YellowPages.com for the command above.

yellow-pages-scraping

The data will be saved as a CSV file. You can download the code at https://github.com/scrapehero/yellowpages-scraper

Let us know in the comments how this code to scrape Yellowpages worked for you.

Known Limitations

This code should be capable to scrape business details of most locations. But if you want to scrape Yellowpages on a large scale, you should read How to build and run scrapers on a large scale and How to prevent getting blacklisted while scraping .

If you need some professional help with scraping websites contact us by filling up the form below.

We can help with your data or automation needs

Turn the Internet into meaningful, structured and usable data

Disclaimer: Any code provided in our tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code. The mere presence of this code on our site does not imply that we encourage scraping or scrape the websites referenced in the code and accompanying tutorial. The tutorials only help illustrate the technique of programming web scrapers for popular internet websites. We are not obligated to provide any support for the code, however, if you add your questions in the comments section, we may periodically address them.

【本文地址】

How to Scrape Yellow Pages Using Python and LXML

How to Scrape Yellow Pages Using Python and LXML

今日新闻

推荐新闻